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ANNOTATION 


Stemming is one of the most common initial data processing steps that can be performed on almost all Natural 
Language Processing (NLP) projects. In the process of Stemming, it is carried out to remove some part of the 
word or shorten the word to its root. Several stemming algorithms are used to decide how to cut a word. In 
determining the stem of Uzbek words, problems such as homonymy of root and suffix with one root, sound 
changes when the suffix is added to the words, stemming of neologisms and NERs can occur. This article 
presents models for solving the problem of the occurrence of sound changes in words in the process of 
performing stemming in the texts of the Uzbek language Corpus. 


KEYWORDS: Natural Language Processing, NLP, root, stem, sound change, POS tagging, morphological 
analyzer. 


INTRODUCTION. In Uzbek language, a word is formed by combining the vowels and suffixes (affixes). 
Analysis of both phonetic and morphological changes is an important task as phonetic harmony and 
disharmony occur when Affixes are added to the root. When solving tasks of NLP, word forms have to be 
reduced to the root (stemming). Removing all flective affixes from a word and lemmatizing the rest of the 
word is considered one of the important tasks of Natural Language Processing (NLP), and this process is 
referred to as stemming. Stemming process is important in information search (IR, Information Retrieval) 
Systems [1]. 


In Uzbek language, where the smallest meaningful part of a word is defined as a root, stem is the largest part 
that gives meaning to a word. Therefore, we can say that the word consists of two parts: stem, which 
explicates meaning, and suffixes. 
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Stem is the part that is formed from the excision of the suffixes of the word form, and may not mean in some 
cases. Also, the stem doesn’t exactly match or match the morphological root of the word. 


Stemming is the task of shortening to its root by removing derivational and flective suffixes added to the 
word. The stemming process can be seen as a “rough” heuristic process that simply cuts off the suffixes of 
words. According to the authors, unlike lemmatization, the stemming process does not use vocabulary or 
morphological analysis [2]. The root formed by stemming does not necessarily resemble the actual word or its 
morphological root. The aim of the stemming process is to shorten similar words to the same root. 


Words in Uzbek language are formed by adding some suffixes (affixes) to the root. In some cases, phonetic 
changes can occur in the word, and this is reflected directly in the text. A root itself may also be a word that 
expresses a specific meaning of a word. While affixes play an important role in a sentence, they do not have 
an independent meaning. Affixes are classified into derivational suffixes and inflectional suffixes [3]. 
Inflectional suffixes change only the grammatical function of the word. By adding derivational suffixes to the 
root, a semantic change in the word can occur. Inflectional suffixes produce syntactic changes in the word. A 
derivational suffix is attached to the root first, followed by an inflectional suffix [4]. 


The number of suffixes that can be added to a word and their multiple compounds make the process of 
specifying a root a complex problem in agglutinative languages. Because in most agglutinative languages, the 
combination of suffixes produces complex word forms. Indicators for the number of new derivational and 
inflectional suffixes are also calculated differently. 


To determine the stem of words in Uzbek, the root and all kinds of suffixes that attach to it are determined. 
Traditional stemming algorithms are based on suffixes and some morphological rules, and uncertainty in stem 
may result from the stemming process. In the process of stemming, all types of suffixes in the word are 
usually removed. But when stemming is performed in this way, a wrong result can be obtained in some cases 
[5]. 

The following problems may arise when determining the stem of words in Uzbek language: 

> homonymy of root and suffix with one stem, 

> sound changes in the word; 

> stemming of neologisms and NERs. 


In this research work, problems with sound change of a word and methods for solving them are presented 
when determining the stem of words in Uzbek language. 


Sound changes in the word 


As a phoneme is realized in speech, its variant manifests differently due to the colloquial conditions (exposure 
to adjacent sounds or suffixes). As a result, the sound acquires properties that are not in the essence of the 
word. Some sound adapts to the adjacent sound, some of which intensify and alternate to another sound. The 
change that occurs in speech is called a combinatorial-positional change in sound[6]. We come across such 
changes a lot in the process of speech activity. 


This situation occurs in the way that a voicing consonant changes to a devoicing one, a plosive consonant 
changes to a nasal consonant, or one wide vowel changes to another wide vowel or some sounds are dropped 
or increased as a result of the addition of auxiliary morphemes to the word composition. Significantly, some 
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of the changes noted above are reflected in both pronunciation and writing, while some occur only in the oral 
speech process and are not reflected in writing. 


Speech sounds that are part of a syllable, word, phrase affect each other, resulting in sound changes. The 
sound change that occurs in the speech stream is called phonetic process[7]. Therefore, any sound change 
occurs as a result of the influence of speech sounds on each other. Such a phonetic process and a detailed 
study of phonetic phenomena open a wide path to our deeper study of the content and essence of Phonetics 
and Phonology. The addition of inflectional suffixes to the end of the root may result in phonetic changes in 
the word (insertion, deletion, phonetic harmony, and assimilation) in some cases [4]. 


Tovush Tovush 
tushishi almashinishi 


Figure 1. Sound changes in the word 


In Uzbek languae, three different phonetic changes can be made in a word, such as assimilation, dissimilation 
and metathesis (Table 1). 


Table 1. Assimilation, dissimilation and metathesis in Uzbek language 


assimilation dissimilation metathesis 
obro‘ y+imiz me+ning boshlig*+i 
achch+iq singl+isi tarog‘+ini 
shun+day ko‘ngl+im san+aydi 
parvoy+im bo‘yn+i yuragt+im 
un+ga olt+ovlon qulog‘+ing 


To solve the problem of sound change in stem detection, the boundaries of the root and suffixes are 
determined in the first step, while lemmatization is carried out in the second step. As a result of 
lemmatization, the error generated stems are changed to root, which is available in the dictionary. 


Orthographic rules 


When certain suffixes are added to some roots, in the root or suffixes, phenomena associated with the 
alternation of one sound to another are also observed. These changes are reflected not only in pronunciation, 
but also in writing. 


1. Metathesis in words 


Metathesis is considered to apply equally to both vowels and consonant sounds. 
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1.1.when the suffixes-v, -q, -qi are added to verbs ending in vowel a, vowel a is pronounced 0 and is written 
as: 


root suffix word form stem 

sayla -V saylov saylo 

tara -q taroq taro 

sayra -qi sayroqi sayro 

1.2. the suffixes-v, -q are added to most verbs ending in vowel i. this vowel is pronounced u and is written as 

root suffix word form stem 
qazi -V qazuv qazu 
sovi -q sovuq sovu 
o‘qi -V o*quv o*qu 
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Note. But when the suffix -q is added to certain verbs that end in vowel i, this vowel i is pronounced and 
written like this’: 


root,stem suffix word form 
og ‘ri -q og‘riq 
qavi -q qaviq 


1.3. When the possessive suffix is added to the multi-syllable and mono-syllable words ending with 


consonants k, q, the consonant k becomes the consonant g, the consonant q becomes the consonant g, 
and is written as. 


Note. But when the possessive suffix is added to the multi-syllable and mono-syllable words, the sound k, q is 


Root Suffixes Word form Stem 
yurak {-im,-ing,-si,-i} yuragim yurag 
qulog {-im,-ing,-si,-i} qulog‘im qulog* 
yo‘q {-im,-ing,-si,-i} yo‘g*im yo‘g* 


originally pronounced and written as 


root, stem suffix Word form 
zZavq {-im,-ing,-si,-1} zavqi 
park {-im,-ing,-si,-i} parki 
nok {-im,-ing,-si,-1} noki 


1.4. Sound change is observed by the addition of derivational suffixes such as -a, -ay, -la to some words’: 


root suffix Word form stem 
ot -a ata at 

ong -la angla ang 

sariq -ay sarg‘ay sarg* 


' https://github.com/KhZilola/Python-Codes/blob/main/Tovush%20almashishi.docx 
* https://github.com/KhZilola/Python-Codes/blob/main/Istisnolar.docx 
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2. Assimilation in words 


There are two different manifestations of assimilation that are reflected only in pronunciation and expressed 
both in pronunciation and in writing. The phenomenon that manifests only in the process of oral speech occurs 
mainly within the framework of loanwords: rus—o ‘ris, shkaf—ishkof, stakan—istakan, traktor—tiraktor. 


The current Uzbek literary language textbook states that the changes observed in the above examples are 
examples of the phenomenon of prosthesis[9]. This phenomenon is explained by the fact that in Turkic, 
including Uzbek, the juxtaposition of consonant sounds at the beginning of a word is very rare, therefore, in 
order to achieve ease of pronunciation, one vowel is first increased and pronounced before them. There are 
several manifestations of assimilation in Uzbek linguistics that have been thoroughly researched by linguistic 
scholars. 


Table 2. Rules for the assimilation in the word in Uzbek language 


Type of How it happens? Examples 

assimilation 

Prosthesis It is a phenomenon associated with the | ro‘mol—/[o‘ramol], 
addition and pronunciation of a single vowel at ro ‘za—[o raza], 
the beginning of a word, before the sound of rais—[o ‘rais], 
sonor [r], the vowel [o0‘] is increased rang—[o ‘rang] 

Epenthesis When it comes to two consonant rows at the fikr—[fikir], 
beginning, middle and end of the word,among hukm—[hukum], 
them, the vowel [i], sometimes [u] and [a] are | doklad—/[dakalad], 
increased and pronounced klass—[kilass] 


2.1.When the suffixes -da, -dan, -day, -dagi, -ga, -gacha, -cha are added to pronouns including u, bu, shu, 
o‘sha, the sound n is added and written as 


root suffix Word form stem 
u -ga unga un 
bu -ga bunga bun 
shu -dagi shundagi shun 
shu -ga shunga shun 


2.2.When first-person, second-person possessive suffixes are added to words parvo, obro‘, mavqe, mavzu, 


avzo, sound y is added and is written as 


root suffix Word form stem 
parvo -im parvoyim parvoy 
obro‘ -im obro‘yim obro‘y 
mavge -im mavqeyim mavqey 
mavzu -im mavzuyim mavzuy 
avZzo -im avzoyim avZzoy 


Note: Third-person possessive suffix are added to wods xudo, mavzu as -si and no assimilation is observed. 
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We can find some information about prosthesis in other literature. Adding sounds to the initial part of a word 
is called a prosthesis. In Turkic languages, including Uzbek, words entered from Russian are pronounced with 
the addition of vowels /i/, [uj, before the consonant compounds that come at the beginning. 
stansiya—[istansiva], stol—[ustol]. But this phenomenon is decreasing[7]. 


In some cases, [I] vowels can also be assimilated when two consonant rows come in at the beginning of a 
word, which are plosive and nasal. shkaf—/[ishkaf], spravka—[ispravka], stol—[istol], stul—[istul], 
shtraf—[ishtraf] 

3. Dissimilation in words 

Speech sounds are the material that makes up speech. Speech saves this material in its construction. This 
saving of speech sounds in the speech is seen as its fall, that is, the phenomenon of dieresis [10]. 


3.1.When the possessive suffix is added to some words such as o‘rin, qorin, burun, o‘g‘il, bo‘yin, ko‘ngil, 
the vowel in the second syllable is not pronounced and not written® 


root suffix Word form stem 
o‘rin -1 o‘rni (o‘rini) o‘m 
qorin -i qorni (qor#ni) qorm 
burun -1 burni (bureni) burn 
o*g‘il -i o‘g‘li (o'g‘ti) o‘g‘l 
bo‘yin -i bo‘yni (bo‘ yini) bo‘yn 
ko‘ngil -i ko‘ngli (ko‘ngili) ko‘ngl 


3.2.When the suffix-il is added to 
pronounced and not written 


verbs such as qayir, ayir, the vowel in the second syllable is not 


root suffix Word form stem 
qayir -il qayril (qayiril) qayr 
ayir -il ayril (ayiril) ayr 


3.3.The vowel in the second syllable is not pronounced and not written when adding the suffixes -ov,- ala, - 
ovlon to the words ikki, olti, yeti 


root suffix Word form stem 
ikki -OV ikkov(ikkiov) ikk 
olti -ala oltala (oltiala) olt 
yetti -ovlon yettovlon(yetttovlon) yett 


3.4.When the suffix —a, —ay, is added to some words, the vowel in the second syllable is not pronounced and 


not written 


root suffix Word form stem 
o‘yin -a o‘yna (o‘yina) o‘yn 
ulug‘ -ay ulg‘ay (uleg‘ay) ulg* 
sariq -ay sarg‘ay (sariqay) sarg‘ 


Note. The phenomenon of metathesis is observed in the word sarg’ay in the form sariqt+ay=sarg’ay 


* https://github.com/KhZilola/Python-Codes/blob/main/Tovush%20tushishi.docx 
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3.5.When the suffixes-ni, -ning, -niki are added to the pronouns of men, sen, the consonant n in the suffix is 
not pronounced and not written 


root suffix Word form stem 
men -ni meni (meni) men 
sen -ning sening (senning) sen 

men -niki meniki (menniki) men 


In different cases, different changes occur in the pronunciation of sounds. N. S. Trubetskoy shows that “... 
practically it is impossible to pronounce exactly one sound in one position even several times clearly and in 
one type " [11]. That is, the speaker pronounces his words in different tones, both in different situations and in 
the same situations. Therefore, each time the sounds of speech are pronounced in a different tone, another- 
with different pitch and quality indicators. A single sound will never repeat exactly as it is. The deeper the 
positional variations of sounds are studied in speech, the wider the types of sounds can be identified and a 
broader picture can be drawn in this regard. In the “Stemming” part of the software of the morphological 
analyzer of the Uzbek language’, word forms are analyzed and some examples related to the phenomenon of 
sound change are cited below. 


Morfologik analizator 


Matnni kiriting 


saylov, taroq, sayroqi. 
yuragim, qulog‘im. 
vqi, parki, noki. 
bo'yni, 
meni, sening, meniki 


So‘z shakli Lemma Stem O'zak 
Ne Qiymati So‘z turkumi Qiymati So‘z turkumi Qiymati So‘z turkumi Qiymati So‘z turkumi Asos va qo‘shimchalar 


1 saylov saylov ot saylo - sayla Fe'l {saylov} 
2 taroq taroq ot taro - tara Fe'l {tarog) 


3 sayroqi sayroqi Sifat .. sayro - sayra Fe'l {sayroqi} 


4 yuragim Ot yurak ot. yurag - yurak ot {yurak}-im 
5 qulog‘im ot qulog ot. qulog* 2 qulog ot {quloq}-im 
6 zavqi ot zavq ot| zavq ot zavq ot {zavq}-i 
7 parki park ot park ot park ot {park}-i 

Matnni kiriting 

saylov, tarog, sayroqi. 

yuragim, qulog'im 

zavqi, parki, noki. 

o'rni, burni, bo‘yni. 

meni, sening, meniki 

saylo ({saylo}-v) yurag ({yurag}im) qulog* ({qulog‘}-im) zavg ({zavq}-i) park ({park}-i) nok ({nok}i) o'r ({o'rm}4) burn ({burn)-i) botyn ({bo‘yn}-i) ‘men ({men}-) sen ({sen}-ing) men ({men}-ik-i) 


Figure 2. Morphological analyzer of Uzbek language (stemming) 


* http://uznatcorpara.uz/uz/Stemmer 
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Conclusion 

In this article, the issue of sound change in a word in the process of performing stemming in the texts of the 
Uzbek language corpus was subjected to analysis. In the article, three different types of phonetic changes in 
certain words of Uzbek, such as assimilation, dissimilation and metathesis, were analyzed on the basis of 
grammatical rules and covered by the means of examples. To solve the problem of sound change in stem 
detection, the boundaries of the root and suffixes are determined in the first step, and lemmatization is carried 
out in the second step. As a result of lemmatization, the erroneously generated stems are changed to an 
existing root in the dictionary. The methods presented in the article were applied to the morphological 
analyzer of the Uzbek language developed by B.Elov, R. Alaev, Sh. Khamroyeva and Z. Khusainova. 
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